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Abstract 

Discriminative patterns are association patterns that occur with disproportionate 
frequency in some classes versus others, and have been studied under names such as 
emerging patterns and contrast sets. Such patterns have demonstrated considerable 
value for classification and subgroup discovery, but a detailed understanding of the 
types of interactions among items in a discriminative pattern is lacking. To address 
this issue, we propose to categorize discriminative patterns according to four types 
of item interaction: (i) driver-passenger, (ii) coherent, (iii) independent additive and 
(iv) synergistic beyond independent additive. The coherent, additive, and synergistic 
patterns are of practical importance, with the latter two representing a gain in the 
discriminative power of a pattern over its subsets. Synergistic patterns are most re- 
strictive, but perhaps the most interesting since they capture a cooperative effect that 
is more than the sum of the effects of the individual items in the pattern. For domains 
such as biomedical and genetic research, differentiating among these types of patterns is 
critical since each yields very different biological interpretations. For general domains, 
the characterization provides a novel view of the nature of the discriminative pat- 
terns in a dataset, which yields insights beyond those provided by current approaches 
that focus mostly on pattern-based classification and subgroup discovery. This paper 
presents a comprehensive discussion that defines these four pattern types and investi- 
gates their properties and their relationship to one another. In addition, these ideas are 
explored for a variety of datasets (ten UCI datasets, one gene expression dataset and 
two genetic- variation datasets). The results demonstrate the existence, characteristics 
and statistical significance of the different types of patterns. They also illustrate how 
pattern characterization can provide novel insights into discriminative pattern mining 
and the discriminative structure of different datasets. 
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1 Introduction 



For data sets with class labels, association patterns [H [39] that occur with disproportionate 
frequency in some classes versus others, can be of considerable value. We will refer to them 
as discriminative patterns [H HI [TOl [161 [13 132] in this paper, although these patterns have 
also been investigated under various names, such as emerging patterns [15], contrast sets [1] 
and supervised descriptive rules [32]. Discriminative patterns have been shown to be useful 
for improving the classification performance [SI [IH 112] and for discovering sample subgroups 



The algorithms for finding discriminative patterns usually employ a measure for the 
discriminative power of apattern. Such measures are generally defined as a function of the 
pattern's relative supportlj in the two classes, and can be defined either simply as the ratio 
[T5] or difference [4J of the two supports, or other variations, such as information gain [8], 
Gini index, or odds ratio [39] etc. 

To introduce some key ideas about discriminative patterns and make the following dis- 
cussion easier to follow, we use the measure that is defined as the difference of the supports 
(DiffSup) of an itemset in the two classes (originally proposed in [1 and used by its ex- 
tensions [231 [31]). Consider Figure [H which displays a sample datasetlj containing 15 items 
(columns) and two classes, each with 10 instances (rows). In the figure, four patterns (sets 
of binary variables) can be observed: A = {ii,i2,i3}, B = {15,16,17}, C = {i9,iio} and 
D = {ii2, ii3, iu}- ^, C and D are discriminative patterns whose DiffSup is 0.6, 0.5 and 0.7 
respectively. In contrast, B is not discriminative with a relatively uniform occurrence across 
the classes {DiffSup = 0). 

Although A, C and D are all considered to be discriminative because of their large 
DiffSup, several observations can be made about their different characteristics. First, one 
of the two items in C has an individual DiffSup of 0.6 (^lo), while the other item (ig) has a 
DiffSup of 0. Given that C itself has a DiffSup of 0.5, it is obvious that the discriminative 
power of the pattern is mainly driven by iiQ, while serves as a "passenger". Such driver- 
passenger effects result from the fact that measures for discriminative power such as DiffSup 
only capture the joint discrimination of a pattern but ignore the specific contribution from 
the items in the pattern [f]. Second, in contrast to C, the DiffSup values of the three individual 
items in A are 0, 0.1 and 0.2, respectively, which are much lower than the DiffSup of A itself 
(0.6). This suggests that the items in A have an incremental effect in their joint discriminative 



^Note that, in this paper, unless specified, the support of a pattern in a class is relative to the 
number of transactions (instances) in that class, i.e. a ratio between and 1. 

^The discussion in this paper assumes that the data is binary. Nominal categorical data can be 
converted to binary data without loss of information, while ordinal categorical data and continuous 
data can be binarized, although with some loss of magnitude and order information. 

•^Such driver-passenger patterns often result when a discriminating, low-support item is combined with 
a high-support, non-discriminating item. Similar issue exists in frequent pattern mining where a relatively 
low support item can form trivial patterns with many high-support items. 



2 



A B C D 





ll 


/2 


/3 


f 4 


/ 5 


f 6 


17 


/ 8 


19 


f 10 


1 11 


/ 12 


/ 13 


1 14 


/ 15 


1 
































2 




















L 








3 
































4 




























5 




























H 




6 
































7 
































8 
































9 
































10 




























-< 




11 
































12 
























N 








" 




























14 














































































17 




























18 






























19 




























20 































Figure 1: A sample data set with three discriminative patterns {A, C, D) and an uninteresting 
(non-discriminative) pattern [B) 



power. Third, in contrast to both C and A, the three items in D have DiffSup values (0.6, 
0.7, 0.6), which are very similar to that of the pattern itself (0.7). Thus, the three items 
in D, as well as their combination, show a coherent behavior in their ability to differentiate 
between class 1 and class 2. 

Patterns A, C and D have shown some of the characteristics of different types of inter- 
action^. Indeed, some characteristics of such interactions have been discussed and studied. 
In particular, we can consider the discriminative power of a pattern as the confidence of 
an association rule by considering the class label as a special item. Then, the difference 
between the confidence of an association rule and the confidences of its subsets has been 
explored in the association rule mining community. Specifically, Bayardo et al. [5] pro- 
posed a measure called improvement as the difference between the confidence of an asso- 
ciation rule (e.g. Conf{X — )• Y)) and the maximal confidence of its simplifications (i.e. 
max{{Conf{X' — )■ 1^)|^' C X})). The association rules that have positive improvement 
are called productive in [13] and are considered to be more desirable than those rules with 
negative improvements. Similar approaches have also been proposed in the context of dis- 
criminative patterns. Garriga et al. [H] studied the closeness of discriminative patterns and 
proposed to remove a discriminative pattern (e.g. X differentiates class 1 from class 2 by 

"^In this paper, we use interaction to denote the relationship among the items in an itemset, and we use 
pattern to denote the concept of itemset and used interchangably with itemset. 
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(a) With support difference. 



(b) With statistic. 



(c) With mutual information. 



Figure 2: Comparing the discriminative power and maximal-subset discriminative power of a set of 
patterns discovered from the UCI Hepatic dataset, with three different measures for discriminative 



power. Each circle represent a pattern, with its color indicating pattern size (same for Figures 3(a) 
El El and ED. 



having higher support in class 1 than in class 2) if the support of X is identical to any subset 
of X in class 2, because such patterns are guaranteed to have non-positive improvements. 

To illustrate the concept of improvement and prepare for the following discussion, Figure 
E] compares the discriminative power of a pattern with the best discriminative power of all its 
subsets, for all the frequent patterns {minsup = 10%) in the Hepatic dataset (UCI [3]). Three 
measures are used in the subfigures (a), (b) and (c) respectively: support difference (DiffSup), 
X^— statistic, and mutual information. The red line indicates y = x, which separates the 
patterns that have positive improvement from those that have negative improvement. A 
common observation consistent across the three subfigures is that, most patterns have at 
least one subset having higher discriminative power (negative improvement). In contrast, 
a small proportion of patterns have much higher discriminative power compared with their 
subsets (positive improvement). This contrast indicates that, although some combinations 
of items have a reasonably high joint association with a class variable, the actual amount of 
improvement can vary greatly from pattern to pattern. 

As shown by existing studies [5lll3] as well as by Figure Ej adding constraints on improve- 
ment can reduce the number of interesting association and discriminative rules substantially. 
However, the current study of the different types of interactions in discriminative patterns 
is lacking in the following respects: 

1. The type of interaction captured by improvement is only one of several interesting 
types of interactions. In other words, a discriminative pattern could have an interesting 
interaction even if it has a close-to-zero or even a negative improvement value. For 
example, pattern D shown in Figure [T] does not have an improvement of discriminative 
power compared to that of its subsets, and may simply be due to the existence of 
multiple redundant discriminative features. However, such coherent differentiation 
of three items may still be interesting in certain domains. Specifically, in the field 
of differential gene-expression module discovery, a discriminative pattern like D may 
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indicate a functional module or protein complex. A specific example will be given in 
section 12.21 

2. Even for the type of interactions captured by improvement, a further understanding of 
the improvement in discriminative power is possible. For example, a large improvement 
can either result from an independent additive aggregation of several items with sepa- 
rate (unrelated) association with a class variable, or a synergistic aggregation beyond 
the independent addition. Differentiating these different types of interactions (Section 
12]) can be useful for biomedical informatics because they generally lead to very different 
types of interpretation for a disease-genetic association [12]. More generally, for other 
real-life applications, understanding different types of postive improvements can help 
us understand the discriminative structure of a dataset. 

Aiming at a systematic understanding on the different types of interactions that are not 
captured by existing work, we motivate, formulate and design comprehensive experiments 
on the characterization of discriminative interactions from a general perspective of the dis- 
criminative pattern mining community. The major contributions of the paper are: 

1. We categorize discriminative patterns into four groups based on the following types 
of interactions: (i) driver-passenger, (ii) coherent, (iii) independent additive and (iv) 
synergistic beyond independent addition. 

2. We present and discuss the properties and utility of the four interaction types we define. 
We also discuss the relationship of the four pattern types to one another. 

3. We design comprehensive experiments on various types of real datasets including ten 
UCI datasets, a gene expression dataset and two genetic variation (SNP) datasets. The 
results demonstrate the existence, characteristics and statistical significance of the dif- 
ferent types of patterns. They also illustrate how pattern characterization can provide 
novel insights into discriminative pattern mining and the discriminative structure of 
different datasets. 

The rest of the paper is organized as follows. In Section [21 we discuss different types of 
interactions and define four types of discriminative patterns. In section 121 we describe the 
datasets and experimental results. Related work on discriminative pattern mining is briefly 
summarized in Section IH followed by conclusions and future work in Section |5l 

2 Different Types of Interactions and the Correspond- 
ing Groups of Discriminative Patterns 

In this section, we describe four types of item interactions and categorize discriminative 
patterns into four groups correspondingly. We also investigate their properties and their 
relationship to one another. 
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First we describe some terminologies that will be used through the rest of the section. 



Let D be a dataset with a set of items, / = 22, "^i/ij) two class labels + and — , and 
a set of n labeled instances (transactions), D = {{xi,yi)}^^^, where C / is a set of items 
and Hi G {+, — } is the class label for ajj. The two sets of instances that respectively belong 
to the class + and — are denoted by and D~ , and we have = ID+I + \D~\. For an 
itemset a <^ I, the set of instances in and D~ that contain a are denoted by D^, D'^ 

and D~ respectively. Let pa, Pa and p~ be support of a in D, and D~ respectively, all 
relative to the entire set of transactions, i.e. -ij^, ^pjp and Let p~^ and p~ be and 
■^^p respectively. 

We use mutual information (MI) as representative measure for discriminative power 
among many others such as the support ratio, support difference and x^-statistic shown 
in section [1] This is because MI is based on information theory, which makes one of the 
interaction measures to be presented later easy to interpret. The MI between an itemset a 
and the class variable C is computed as follows: 



where qa, and q~ are 1 — Pa, 1 — Pa and l—p^ respectively. Note that, in this paper, 
MI is always normalized by the entropy of the class variable {11(C)), after which, it ranges 
from to 1. 

2.1 Driver-passenger Interaction (Tl) 

Pattern C shown in section [1] is an illustration of discriminative patterns with a driver- 
passenger interaction, where the driver and the passenger are both a single item in the 
pattern. More generally, any discriminative pattern with a subset having similar discrimina- 
tive power as the entire pattern while another disjoint subset in the pattern showing weak 
discriminative power are considered to have a driver-passenger interaction. Formally, we 
define the discriminative patterns with this type of interaction (Tl) as follows: 

Definition 1: An itemset a is a Tl discriminative pattern if the following criteria are 
met together for 6 > 0, j > 0, e > 0: 



Criterion (a) is a general requirement of the discriminative power of an itemset, which will 
also be used in the definition of the other types of discriminative patterns. Criteria (b) and 
(c) require the existence of at least one driver and at least one passenger in a, respectively. 
Similar to C in Figure [H Tl discriminative patterns are generally not interesting because 
the passengers are included in a pattern as a purely mathematical consequence rather than 




(1) 



(a) M/(a, C) > 6, 

(b) 3a' C a,\MI{a,C) - MI{a',C)\ < j, 

(c) 3a" C (a - a'), MI {a", C) < e. 



(2) 
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an interpretable relationship with the other items in the pattern. Thus, in the rest of the 
paper, we will focus on the other three types of interactions that can serve as evidence of 
meaningful relationship among the items in a pattern. 



2.2 Coherent Interaction (T2) 

The illustrative pattern D in Figure [1] represents a type of interaction in which every item 
in a pattern is contributing with a discriminative power similar to that of the entire pattern. 
We call this a coherent interaction, and refer to patterns having this type of interaction T2 
patterns. 

Definition 2: An itemset a is a T2 discriminative pattern if the following criteria are 
met together for 5 > 0, j > 0: 

(a) MJ(a, C) > 6, 

(b) incoherence{a) < j, (3) 

(c) G a,direction{i) = direction{a) . 

The incoherence in criterion (b) is calculated as the ranged of values in {MI{a,C)} U 
{MI{i,C)\i G a}. Criteria (a) and (b) capture the unique property of this type of coherent 
interaction, i.e. each individual item in a pattern has similar (controlled by j) discriminative 
power as the pattern itself. Given that MI does not indicate the direction of the differenti- 
ation (i.e. a pattern or an item can be either more frequent in class + or more frequent in 
class — ), criterion (c) is further used to make sure that all the items in a pattern have the 



same differentiating directionality as the pattern itself. Figure 3(a) illustrate the existence 
of T2 discriminative patterns with a real gene expression dataset. Each circle represents a 
pattern. The circles above the horizontal line meet criterion (a), and the circles on the left of 
the vertical line meet the criterion (b). Criterion (c) is implicitly enforced in the generation 
of the figure. The circles in the upper-left corner are T2 discriminative patterns. Note that 
the definition of different type of interactions is with respect to the specified parameters 
(here 6 = 0.1 and j = 0.05), rather than a clear-cut separation. With different parameter 
values, different set of patterns will be considered to have a certain interaction. 

The essential difference between Tl and T2 discriminative patterns is that Tl patterns 
include passengers (guaranteed by the criterion (c) in Definition 1), while T2 patterns do 
not include passengers (guaranteed by the criterion (b) in Definition 2). This difference is 
what distinguishes Tl, an uninteresting type of discriminative pattern, from T2, a a po- 
tentially interesting type of discriminative pattern. Specifically, if a dataset has many T2 
discriminative patterns, we can speculate that it contains features that are discriminative 
and correlated with each other. Such correlation may either be due to the existence of 
multiple discriminative features that are redundant with each other (uninteresting), or may 
correspond to a functional module or protein complex that is associated with a disease in 



'Difference between the maximal and the minimal value. 
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Figure 3: Illustration of T2 discriminative patterns on the gene expression dataset (described in 
section [3|). (a) The entire set of discovered patterns; (b) visulization of the pattern in a binary 
matrix format (black indicating I's and white representing O's similar as Fig. [T]) with the horizontal 
yellow line separating the two classes and the vertical green lines separating genes from each other. 



the context of differential gene-module discovery. For instance, figure 3(b) illustrates a T2 
pattern discovered from the gene expression dataset of a study on breast cancer |4l] (section 
13. ip . The genes in the pattern {MI{a,C) = 0.10 and incoherence{a) = 0.02) demonstrate 
similar type of differentiating effect as C. Discovering such patterns rather than the indi- 
vidual items separately could provide valuable insights towards the understanding of gene 
interactions in complex diseases. Indeed, three genes in the pattern {BIRC5, Contig38901 
and Contig 4:14:13) have been associated with breast cancer specificalljQ, and the other one 
(CCNBl) was identified as a general tumor-related gene [33]. These facts suggest that the 
genes in the pattern may correspond to a functional module or protein complex. 



2.3 Independent- Additive Interaction and Synergistic Interaction 
beyond Independent Addition (T3 and T4) 

In addition to coherent interaction, another type of interesting interaction in biomedical 
and genetic domains is a pattern containing a set of items (e.g. genes) that has better 
discriminative power than any of its subsets. Pattern A illustrates such an example, i.e. 
the three individual items are not discriminative by themselves while they have a 100% 

^www. genecards.org 
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prediction confidence as a combination. 

As discussed in section [H this type of interaction can be captured by existing measures 
sucfi as improvement, wfiicfi is defined to be tfie difference between tfie discriminative power 
(e.g. MI) between a pattern and its best subset. However, a deeper understanding of tlie 
cfiaracteristics of tfie improvement in discriminative power is possible. For example, for pat- 
tern A in Figure [H tfie large improvement can eitlier result from an independent additive 
aggregation of several items witli separate (unrelated) association, or a synergistic aggrega- 
tion beyond the independent addition. Differentiating these different types of interactions is 
important because they generally lead to very different types of interpretation of a disease 
association. 

Next, we will first discuss two different types of improvement interactions and then define 
another two types of discriminative patterns accordingly. 

2.3.1 Differentiating two types of improvement interactions 

Bayardo et al. |5j defined improvement in the context of association rule mining based on the 
confidence of a rule. We first rewrite the improvement {Imp) in the context of discriminative 
pattern mining based on MI as below: 

Imp^{a) = MI{a, C) - maXa'ca{MI{a' , C)). (4) 

To ease the motivation of different types of improvement, we consider the following equa- 
tion for a pair of items a = {ia,ib}- 

Imp^\a) = MI{a, C) - max{MI{ia, C), MI{i^, C)), (5) 

which is essentially the amount of additional information about the class variable C 
that can be provided by the two items as a combination, compared to the information that 
each item can provide (the bigger one). This additional amount of information can either 
result from an independent additive aggregation of several items with separate (unrelated) 
association, or a synergistic aggregation beyond the independent addition. 

D. Anastassiou [2] applied a measure called synergy (originally used in neuroscience liter- 
ature [19j) to discover gene-gene interactions that are beyond the independent addition of all 
possible partitions of its subsets. In this paper, we leverage it to characterize discriminative 
patterns from a more general perspective. 

We start from the following equation for calculating the synergy computation between a 
size-2 pattern a = {iajU} and a class variable C, 

Syrf{a) = MI{a, C) - (M/(i„, C) + MI{ib, C)), (6) 

which is calculated as the amount of additional information about the class variable C 
that can be provided by the two items as a combination, compared to the information that 
each of the item can provide independently (sum of the two individual Mis). Compared to 
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Figure 4: Illustration of the mechanism of synergistic interaction in the context of yeast genetic 
interaction (Figure taken from Costanzo et al |13j). 

Equation |5], the essential difference between improvement and synergy is that, improvement 
is with respect to the bigger MI of the two, while synergy is compared to the summation 
of Mis of the two. Indeed, the summation of the mutual information of two items is used 
in information theory to represent the combined effect of two items with independent asso- 
ciation with a class variable [2]. Thus, synergy can be leveraged to refine the discriminative 
patterns with positive improvement, based on the characteristics of an improvement. 

In order to provide an intuitive understanding. Figure H] illustrates an underlying mech- 
anism of synergistic interaction in the context of yeast genetic interaction. Two distinct 
pathways are shown in the figure, i.e. A — t- S — t- C and X ^ Y ^ Z, which impinge on 
a common biological process that is essential to the survival of a yeast cell (the wild type). 
Due to parallel structure, the two pathways can compensate for the loss of the other, and 
thus a genetic perturbation (natural variations) on either of the two pathways separately 
(e.g. perturbation only in A) fails to cause any observable defects in cell survival. However, 
the simultaneous perturbations in A and Y disrupt both pathways and result in the lethality 
the cell. In this example, A and Y have a synergistic interaction with respect to the class 
label (survival or not) of a cell. 

Equation [7] gives the general definition of synergy for an itemset a beyond pairs (also 
defined in [2]). 

SynP{a) = MI{a, C) - max^u partitions into {Si} ^ MI{Si, C), (7) 

i 

where a partition is defined as a collection {S.)} of disjoint subsets Si whose union is a. 
For example, for a size-3 pattern(a = {i^, ib-, ^c}). 



10 



{MI{ia,C) + MI{it, C) + M/{j„ C) 
MI(i,,C) + MI({ia,ic},C) 
MI(ic,C)+MI{{ia,it},C) ^' 
MI{ia,C) + MI{{it,ic},C) 

This generalized definition is consistent with the intuition that synergy is the additional 
amount of information about a class variable provided by an integrated discriminative power 
compared with what can be best achieved after breaking the pattern into components by 
the sum of the contributions of these components. The partition of the set of factors that 
is chosen in this formula is the one that maximizes the sum of the amounts of mutual 
information connecting the subsets in that partition with the class variable, and we will 
refer to it as the best aggregated MI. Note that, the computational complexity of synergy for 
an itemset of size n is 0{B{ri)), where B{n) is the n*'* Bell numbei0, which increases in a 
dramatically fast speed. In practice, to avoid unnecessary computations, we actually only 
need to compute synergy (as well as best aggregated MI) for those patterns with positive 
improvement, which is much more efficient to compute, i.e 0{n). 

Given the definition of improvement and synergy, it is easy to notice that synergy is 
guaranteed to be larger than improvement (follows from the fact that MI is non-negative. 
Proof omitted). Essentially, synergy is a more restrictive measure specifically for capturing 
interaction beyond independent addition. Figure [5] compares how improvement and synergy 
capture the interaction of discriminative patterns discovered from a gene expression dataset 
(described in section [X^ . Figure 5(a) shows the MI and best subset MI of the discriminative 
patterns as a scatter plot. The horizontal dashed line indicates the cutoff values for MI, and 
the other dashed line (representing y = x) separate the patterns with MI higher than best 
subset MI (positive improvement) with those that have negative improvement. As shown, 
there are quite a few patterns above both the horizontal line and y = x, with size ranging 



from 2 to 5. In contrast to Figure 5(a) , the x-axis in Figure 5(b) is best aggregated MI instead 



of best subset MI. Corresponding to this difference, there are far fewer discriminative patterns 
(all of size-2) that are above both the horizontal line (high discriminative power) and y = x 
(positive synergy) at the same time. This contrast is as expected given our discussion above 
that synergy is a more restrictive type of interaction beyond the independent additive effect, 
and is guaranteed to be no more than improvement for any pattern. 



2.3.2 Defining two different types of discriminative patterns with positive im- 
provement 

With synergy, we can divide all discriminative patterns with positive improvement into two 
groups., i.e. those that have negative synergy and those that have positive synergy. Alterna- 
tively, the two groups can also be defined as those patterns that have positive improvement 
(including both positive and negative synergy) and those that specifically have positive syn- 

^en. wikipedia.org 
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(a) Positive Imp (above y = x) (b) Positive Syn (above y = x) 

Figure 5: Illustration of general improvement and synergistic interaction beyond independent 

additive effect on the gene expression dataset. Best aggregated MI is only computed for those 
patterns that have positive Imp. 



ergy. We take the latter route, given its simplicity in term of the definitions as shown below. 
Note that the observations made from both routes are essentially the same. 

Definition 3: An itemset a is a T3 discriminative pattern if the following criteria are 
met together for 6 > 0, j > 0: 

(a) M/(a, C) > 5, 

(b) Imp^{a) > j. ^ ' 

Definition 4: An itemset a is a T4 discriminative pattern if the following criteria are 
met together for 5 > 0, j > 0: 

(a) MIia,C)>6, 

(b) Syrf{a) > j. ^ 

For illustration, we note that pattern A in Figure [1] is a T4 discriminative pattern, 
with MI{A) = 0.39, individual item Mi's 0.007, 0.008, and 0.029, respectively. The M- 
improvement is 0.18, while the synergy is 0.17. 

If a dataset has many T3 discriminative patterns, we can speculate that it contains 
features that complement each other for higher discriminative power in their association 
with the class variable. Further, if there are also many T4 discriminative patterns, it is 
expected that some features have synergistic cooperative effect beyond independent addition. 
In contrast, if a dataset has few or no T3 discriminative patterns, the discriminative features, 
if they exist in the dataset, are expected to be either correlated with each other (2^2) or not 
form high-order combinations at all, i.e., have a very low joint frequency to pass the support 
threshold. 
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Figure 6: Illustration of T3 and T4 discriminative patterns on the gene expression dataset. Syn 
is only computed for those patterns that have positive Imp. 



Figure |6] shows the two sets of patterns: T3 (upper right region in Figure 6(a)) and T4 



(upper right region in Figure 6(b) ) respectively, both with 6 = 0.1 and j = 0.05. Note that, 
there are only two patterns (size-2) that have synergy greater than j = 0.05. This again 
indicates that the synergistic interaction in T4 patterns is rare. However, as will be shown in 
section [3^ these two T4 patterns (even very rare) are statistically significant after correcting 
for multiple hypothesis testing to control type I error (false discover rate < 0.01), and thus 
can be of significant interest in the biomedical domain. After all, j = 0.05 is an arbitrary 
threshold that is used to illustrate the concept. In fact, there are many other discriminative 



patterns with positive synergy (even though they are below 0.05) as shown in Figure 6(b) 
which may also be interesting to specific domains. 



Figure [7] illustrates two example patterns for T3 and T4 respectively. In Figure 7(a) 
the individual Mis of the two SNPs are 0.054 and 0.04 respectively. As a combination, it 
has a MI of 0.107, which is almost the same as the sum of the two individual Mis (a low 
synergy of 0.013), indicating a independent additive effect and thus a T3 pattern. In contrast. 



the two SNPs in Figure 7(b) have a high synergy of 0.108, indicating a large cooperative 
effect beyond independent addition. Indeed, the two genes that the two SNPs are located 
on, MSH6 and DPYD are known to code proteins that have the following function^: (i) 
recognizing mismatched nucleotides and (ii) catabolizing two specific types of nucleotides 
(uracil and thymidine), respectively. The fact that they have a synergistic interaction agrees 
with their closely related functions and potential compensation for each other as illustrated 
in Figure |H 



^ www. genecards.org 
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Ml: 0.107; Ml-Imp: 0.053; Synergy:0.013; 
Ml Individual: 0.054, 0.040 



Ml: 0.160; Ml-Imp: 0.134; Synergy:0.108; 
Ml Individual: 0.026, 0.026 




SLC01A2 
rs4337089 



XRCC4 
rs2075685 



MSH6 
rs202091 1 



DPYD 
rsl 520663 



(a) A T3 example. 



(b) A T4 example. 



Figure 7: A T3 example and a T4 example, both discovered from the M-Survival SNP dataset as 
described in section 13.11 (refer to Fig. |3(b)| for similar description. 



2.4 The relationships among the four different types of interac- 
tions 

In this subsection, we discuss the relationships among different types of interactions and 
relate other types of interactions to the four defined interactions in order to have a systematic 
understanding about item interactions in discriminative patterns. 

Figure [8] shows the three interesting types of discriminative patterns (T2, T3 and T4) 
in the context of all discriminative patterns using a Venn diagram. The outermost circle 
contains all the discriminative patterns with MI > 6. The set of T3 discriminative patterns 
is a superset of the set of TA patterns based on the Definitions 3 and 4 (with the same a and 
j) and the fact that synergy is always no more than improvement for any pattern. The set 
of T2 discriminative patterns is disjoint with the set of T3 patterns, when the same value 
of j is used in Definitions 2 and 3. Specifically, for any given value of j, criterion (b) in 
Definition 2 and criterion (b) in Definition 3 can not be met at the same time. 

The next natural question is the nature of the discriminative patterns that are not any 
of the three types (T2, T3 and T4), i.e. the region represented by the gray background 
color. Indeed, they can all be considered to be in one of the two possible cases: either (i) 
Tl patterns with the driver-passenger interaction or (ii) the patterns, each of which can be 
considered as a combinatiorG of T2 and T3 patterns. Due to the limit of space, the prove of 
this is available on the paper website. 

^For example, if i? is a T2 pattern and Q is a T3 pattern, then i? U Q is a combiantion of T2 and T3 
pattern. 
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Note that, the goal of characterizing discriminative patterns with different types of inter- 
actions is to identify different types of interesting discriminative patterns, which are specif- 
ically T2 — T4 in the context of this paper. It is worth noting that we do not exclude the 
possibility that the patterns in the gray region (Tl patterns, or combinations of T2 and 
r3 patterns) may also be interesting in some specific domains even though they are not 
considered as such in this paper. Thus the focus of this paper is to initiate a study of the 
item interactions in discriminative patterns, rather than identifying the all possible types of 
interesting item interactions in discriminative patterns. 

2.5 Correction for Multiple Hypothesis Tests 

As discussed by recent work [201 122], an association pattern mining task (e.g. frequent 
patterns, discriminative patterns) essentially conducts a large number of hypothesis testing. 
Thus, in order to control type I error (due to the multiple hypothesis testing), corrections 
on the significance of the discovered patterns is necessary. Among different approaches for 
correcting multiple hypothesis testing, the randomization based approaches [20] are non- 
parametric and thus more reliable in term of not introducing bias. Randomization frame- 
works have been extensively explored in the context of frequent pattern mining and clustering 
[22j . For discriminative pattern mining, a special type of randomization procedure is needed, 
in which the randomization is performed by shuffling the class labels for the samples. For 
the details of the randomization and the calculation of corrected p-value or false discovery 
rate (FDR), refer to [T71 dHl EH]- In section [X21 we will show that many of the discovered 
T2 — T4 patterns are statistically significant after correcting for multiple hypothesis tests. 
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Datasets 


Size of 
(+ class) 


Size of 
(— class) 


#of 
Items 


Density 


MI > 5 


# of T2 


# of T3 


# of T4 


% of T2 - T4 
patterns 


Breast (GEP) 


217 


78 


11962 


0.1662 


1642 


240 (29) 


13 (21) 


2(4) 


0.154 


Lung (SNP) 


96 


99 


8777 


0.3855 


6 





4(4) 


2(3) 


0.667 


M-Survival (SNP ) 


70 


73 


8265 


0.3325 


62 





32 (42) 


16 (27) 


0.516 


Chess (UCI) 


1527 


1669 


73 


0.4932 


109 





5(7) 


1(3) 


0.046 


Sonar (UCI) 


111 


97 


42 


0.5 


19476 


268 (37) 


21 (17) 





0.015 


Hepatic (UCI) 


32 


123 


33 


0.4561 


3144 


16 (11) 


6(7) 





0.007 


Cleve (UCI) 


165 


138 


27 


0.4074 


708 


21 (12) 


2 (4) 





0.032 


Horse (UCI) 


232 


136 


57 


0.2234 


106 


1 (2) 


1 (2) 





0.019 


Adult (UCI) 


11687 


37155 


94 


0.1371 


661 


4(6) 








0.006 


Crx (UCI) 


307 


383 


50 


0.2784 


625 


3(6) 








0.005 


Hypo (UCI) 


151 


3012 


50 


0.4524 


644 


1 (2) 








0.002 


Mushroom (UCI) 


3916 


4208 


118 


0.1923 


2334 














Waveform (UCI) 


1657 


1647 


102 


0.1863 


17 















Table 1: Details of each dataset and a summary of the number of different types of discriminative patterns discovered. For 
column 7 — 9, in addition to the number of discovered patterns, we also show (in the bracket) the number of unique items in the 
union of the set of patterns to reflect the redundancy among the patterns. S = 0.1 and j = 0.05 are used for all the datasets. 



3 Experiments 

In this section, we use a variety of real datasets to demonstrate the existence, properties and 
statistical significance of different types of discriminative patterns that we characterized in 
section |2l We also show how the characterization can provide novel insights into discrimi- 
native pattern mining and the discriminative pattern structure of different datasets, beyond 
those provided by current approaches that focus mostly on pattern-based classification and 
subgroup discovery. 

3.1 Data Sets 

We use the following three different types of real datasets, with details summarized in Table 
[Hand detailed pre-processing steps described on the paper website: 

Ten UCI datasets [3] with a variety of dimensions and densities (Tabled]). 

A gene expression dataset on breast cancer [lO] (pre-processed as suggested in pi] 
and binarized as done in [T7|[29]. We denote this dataset by Breast(GEP). 

Two single-nucleotide polymorphism (SNP) datasets: SNP profile captures the 
genetic variations of a person at single-nucleotide resolution, which are commonly used 
in disease-association studies [3 |Ml Si]- The diseases studied with these two datasets 
are Myeloma [6] and lung cancer [IT] respectively. We denote these two datasets by M- 
Survival(SNP) and Lung(SNP). 

3.2 Experimental Results 

For each dataset, we first discover a set of discriminative patterns with existing algorithms. 
Specifically, for the dense and high dimensional datasets (Breast (GEP), the two SNP 
datasets. Chess (UCI) and Hypo (UCI)), we leverage the SMP algorithm proposed in [T7] 
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(a) M-Survival(SNP) T2 



(b) M-Survival(SNP) T3 



Synergy 

(c) M-Survival(SNP) T4 



0.2 
I 0.15 

O.lr 

0.05 



12 




T3 



(d) Sonar(UCI) T2 



Ml-Imp 

(e) Sonar(UCI) T3 





T4 










•"■"'t'" 




-0.4 -0.2 
Synergy 


0.2 


(f) Sonar(UCI) T4 



Figure 9: Existence of T2 — T4 patterns in two representative datasets: (a)-(c) M-Survival (SNP); 
(d)-(f) Sonar (UCI). In subfigures (c) and (f), synergy is only computed for those patterns that 
have positive improvement. 



to discover discriminative patterns {SupMaxPair = 0.2). For the other sparse or low- 
dimensional datasets we simply use FPC |21] with minsup = 10%, because SMP may miss 
some high-support patterns although it is efficient on discovering discriminative patterns 
from dense and high- dimensional data |l7j . 

For each set of discovered patterns (only closed itemsets), we apply the criteria of the 
three types of discriminative patterns (T2 — T4) presented in section [2] and get the number 
of patterns for each type. Figures [9] illustrate the existence of T2, T3 and T4 discriminative 
patterns in the representative SNP dataset (subfigures (a)-(c)), and the representative UCI 
datasets (subfigures (d)-(f)). Note that, the similar set of figures for the gene expression 
dataset can be found in section [2731 i.e. Figures 3(a) 



6(a) and 6(b) 



E 



Several observations about each type of interactions can be made from Table [T]and Figure 

1: T2 discriminative patterns are common in most UCI datasets and the gene 
expression dataset, but not in the SNP datasets: On one hand, this indicates that 
the UCI datasets and the gene expression dataset have features that are discriminative and 
correlated with each other. On the other hand, the fact that the SNP datasets do not have 
T2 patterns indicates that the discriminative SNPs are not correlated with each other. In 
addition, column 7 in Table [T] indicates that the across-pattern redundancy is high in T2 
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patterns, i.e. the number of unique items is generally much smaller than the number of 
patterns. 

2: T3 discriminative patterns exist in about half of the UCI datasets and all 
three of the biological datasets: These datasets are expected to contain discriminative 
features that are complementary to each other in their improved discriminative power as a 
pattern. In contrast, the other datasets that have very few or no T3 discriminative patterns, 
the discriminative features, if they exist in the dataset, are expected to be either correlated 
with each other (T2) such as Cleve (UCI) or simply do not contain interesting feature combi- 
nations (independent association with the class variable) such as Mushroom and Waveform. 
The fact that the three biological datasets have many T3 discriminative patterns is consistent 
with common knowledge that complex diseases involve the cooperation of multiples genes. 
This is especially true for the the two SNP datasets, where there are no T2 discriminative 
patterns but many T3 discriminative interactions. In addition, column 8 shows that the 
across-pattern redundancy in T3 patterns is lower than in T2 patterns, because the number 
of unique items is generally similar as the number of patterns. 

3: T4 discriminative patterns exist in all three of the biological datasets 
and only one UCI dataset (Sonar): First, T4 pattern is rare because T4 is based on the 
most restrictive type of interaction (synergy). Nevertheless, the fact that the gene expression 
and SNP datasets contain many T4 discriminative patterns indicates the relatively higher 
complexity in genetic datasets compared to the common UCI datasets. In addition, column 
9 shows that the across-pattern redundancy in T4 patterns is similar as in T3 patterns, 
which are both lower than in T2 patterns. 

4: The number of T2 — TA patterns is much smaller compared to the overall 
number of discriminative patterns: The last column in Table [1] shows the fraction of 
discriminative patterns that are either T2, TS or T4 (the three interesting types). Except 
for the two SNP datasets, the fractions are generally very low, which indicate that many dis- 
criminative patterns with good discriminative power are not interesting from the perspective 
of the interestingness considered in this paper. The extreme cases are the Mushroom and 
Waveform datasets, which do not contain any of the three types of patterns. This indicates 
that the discriminative features are neither uncorrelated with each other nor complementary 
to each other in these two datasets, i.e. independently discriminative features. This obser- 
vation indicates that the actual number of interesting patterns is much more manageable 
compared to the huge number of patterns that are generally encountered without a detailed 
characterization. 

5: T2 — TA patterns generally have smaller size compared to the entire set of 
discriminative patterns: From the color of the circles in the figures, T2 — TA patterns 
are generally of size 2 — 6. This is in contrast to the wider range of sizes for the entire set 
of discriminative patterns, which can be as high as 14. This agrees with the observations 
made in the recent work on constraint-based generation of high-order discriminative patterns 
[36] . Specifically, Steinbach et al. observed that the larger (size) an itemset becomes, the 
harder it is for the itemset to meet the constraints for a discriminative pattern, when the 
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constraints are not only on the discriminative power of the pattern but also the improve- 
ment of the discriminative power. This also suggests that the computational complexity of 
discriminative pattern mining could be less than expected given that too large patterns tend 
to be uninteresting in term of the meaningful relationships scoped in this paper. 

6: T2— T4 patterns discovered from all the datasets are statistically significant. 
In the columns 7 — 9 in Table [H all the T2 — T4 patterns are statistically significant with 
FDR < 0.01 after correcting for multiple hypothesis tests to control type I error (method 
discussed in section |23|) . Specifically for the three biological datasets, the characterization 
of those statistically significant gene or SNP combinations can assist the further biological 
interpretations, and reveal novel insights to the mechanisms of complex diseases. 

The above comprehensive observations illustrate the existence, characteristics and sta- 
tistical significance of the different types of patterns. They also illustrate how the proposed 
framework can provide novel insights into discriminative pattern mining and the discrimina- 
tive pattern structure of different datasets. 

4 Related Work 

Over the past decade, many approaches have studied discriminative patterns and related 
topics. The most relevant related work was discussed earlier in Section [H Among other work 
focusing on mining discriminative patterns, the most relevant ones are [211 EHl ESI 130] • Many 
existing approaches also used discriminative pattern for classification [261 El El SSI 122] • 
Additional related papers in the area include [2S1 EU |28l [HI |3S1 [H] • We also refer the readers 
to a comprehensive survey on discriminative patterns by Novak et al. |32j . 

5 Conclusion 

In this paper, we categorized discriminative patterns into four groups based on item inter- 
actions: (i) driver-passenger, (ii) coherent, (iii) independent additive and (iv) synergistic 
beyond independent addition. The coherent, additive, and synergistic patterns are of prac- 
tical importance, with the latter two representing a gain in the discriminative power of a 
pattern over its subsets. Synergistic patterns are most restrictive, but perhaps the most 
interesting since they capture a cooperative effect that is more than the sum of the effects 
of the individual items in the pattern. The experiments provided a number of insights into 
the nature of discriminative patterns in various real datasets and the characteristics of the 
different types of patterns. 

Particularly worth noting is that all types T2 — T4 patterns were significant in all the 
datasets for which we evaluated pattern significance. While this needs to be investigated 
further, we believe that this is mostly due to the pruning of a large number of patterns that 
are not likely to be of interest. Without such pruning, the number of patterns is typically very 
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large, as is typical in most types of association analysis, and thus, the FDR of the resulting 
patterns tends to be low unless the patterns are very strong since FDR depends very heavily 
on the number of patterns being considered. We are hopeful that this observation will allow 
discriminative pattern mining to be more effectively used for a wide variety of applications, 
both in the biomedical area and beyond. 

Several further directions can be explored in the future. (1) The four types of patterns 
defined in the paper are mainly based on the building-block measure mutual information 
to make the presentation consistent and easy to follow, and other statistical measusres can 
also be explored as building-block measures or specifically for a certain type of pattern. 
For instance, the logistic regression-based measure studied in [37] can be leveraged as an 
alternative to synergy, (2) Other types of interactions can be explored especially those that 
may be interesting to specific domain but are considered as non-interesting in the context of 
this paper. (3) From the computational perspective, it is also interesting to design mining 
algorithms that can directly search for a particular type of discriminative patterns, which is 
expected to be much faster given a more specific definition, whose anti-monotonic properties 
can be leveraged as additional pruning constraints. 
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